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Abstract.  This  document  describes  the  participation  of  Vienna  Univer¬ 
sity  of  Technology  in  the  TREC  Clinical  Decision  Support  Track  2014. 
Four  different  search  models  are  investigated,  as  well  as  different  strate¬ 
gies  to  index  the  corpus  and  to  extract  the  most  relevant  information 
from  the  topics.  Our  results  conclude  that  BM25  and  Vector  Space  Model 
had  similar  performance  for  P@10  and  inferred  NDCG. 
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1  Introduction 

Searching  for  health  has  become  a  common  task  nowadays.  Pew  Research  Center 
estimates  that  80%  of  the  American  population  uses  the  Web  to  seek  health 
information  [2].  In  line  with  this  trend,  various  health-related  campaigns  were 
proposed.  Some  examples  are  the  TREC  Genomics  Track  [7]  which  ran  from 
2003  to  2007,  the  TREC  Medical  Records  Track  [9]  running  in  2011  and  2012, 
the  ImageCLEFmed  Track  on  medical  image  retrieval  [4,5]  running  between 
2003  and  2013,  and  the  ShARe/CLEF  eHealth  Evaluation  Lab  [8,3]  running 
in  2013  and  2014.  Here  we  briefly  describe  the  goals  of  the  hrst  TREC  Clinical 
Decision  Support  Track  (TREC-CDS)  and  the  participation  of  Vienna  University 
of  Technology. 

The  TREC-CDS  is  focused  on  physicians  searching  for  relevant  informa¬ 
tion  for  patient  care.  As  document  collection,  it  uses  the  open  access  subset  of 
PubMed  Central  (PMC),  containing  a  total  of  733,138  articles.  The  topics  are 
divided  into  three  main  types:  diagnosis,  test  and  treatment.  Figure  1  shows  a 
diagnosis  query. 

As  there  was  no  development  set  available,  we  decided  to  experiment  with 
different  search  models  and  indexing  possibilities,  trying  to  build  a  initial  foun¬ 
dation  for  our  future  participation  next  year. 

Our  Contribution 

In  this  paper,  we  experiment  and  evaluate  a  large  variety  of  search  models  and 
indexing  strategies,  as  well  as  ways  of  combining  different  models  and  indexes. 
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<topic  nuniber="8"  type="diagnosis"> 

<description> 

A  62-year-old  man  sees  a  neurologist  for  progressive  memory  loss  and 
jerking  movements  of  the  lower  extremities.  Neurologic  examination 
confirms  severe  cognitive  deficits  and  memory  dysfunction.  An 
electroencephalogram  shows  generalized  periodic  sharp  waves . 
Neuroimaging  studies  show  moderately  advanced  cerebral  atrophy. 

A  cortical  biopsy  shows  diffuse  vacuolar  changes  of  the  gray  matter 
with  reactive  astrocytosis  but  no  inflammatory  infiltration. 
</description> 

<summary> 

62-year-old  man  with  progressive  memory  loss  and  involuntary  leg 
movements.  Brain  MRI  reveals  cortical  atrophy,  and  cortical  biopsy 
shows  vacuolar  gray  matter  changes  with  reactive  astrocytosis. 

</ summary> 

</topic> 

Fig.  1:  Example  of  a  diagnosis  query 


2  Experiments 

In  our  experiments,  we  explore  several  different  search  techniques,  IR-system,  as 
well  as  different  indexing  strategies.  In  this  section  all  the  different  conhgurations 
used  will  be  described  in  details.  In  Section  2.1,  we  explain  how  we  create  three 
varieties  of  index  using  the  MeSH  hierarchy.  Thereafter,  in  Section  2.2  we  explain 
our  query  formulation  method,  where  we  make  use  of  Metamap  to  retain  only 
the  most  important  concepts  from  each  topic.  In  the  Sections  2.3,  2.4  and  2.5  we 
briefly  explain  the  3  different  IR-systems  that  we  use  for  our  runs:  Runl  (Indri), 
Run2  (Lucene),  and  Run3  (Xapian).  For  each  system,  we  generate  6  runs:  a 
combination  of  the  3  indices  methods  from  Section  2.1  and  2  query  strategies 
from  Section  2.2.  We  merge  the  scores  of  each  run  into  a  final  run  for  each 
system.  For  Run4,  we  combine  the  documents  from  the  previous  3  runs,  as  we 
explain  in  Section  2.6.  Finally,  we  explore  Word2Vec  in  our  Run5,  explained  in 
Section  2.7. 


2.1  Indexing  Concepts 

We  take  advantage  of  the  Medical  Subject  Headings  (MeSH^)  hierarchy  to  keep 
only  the  important  concepts  of  each  document  in  the  collection.  MeSH  has  an 
hierarchical  structure  for  a  set  of  terms  named  descriptors  as  shown  in  Figure  2. 
The  hierarchical  structure  makes  it  possible  to  narrow  the  scope  of  the  terms.  It 
is  updated  every  year  and  the  2014  version  has  27,149  descriptors. 

Based  on  MeSH  hierarchy,  we  create  3  types  of  indexes: 

1.  All  words:  we  index  the  documents  as  they  are,  without  removing  and  word; 

^  http : //www.nlm.nih . gov/mesh/MBrowser .html 


1.  ♦  Anatomy  [A] 

2.  ♦  Organisms  [B] 

3.  -  Diseases  [C] 

o  BaetenallnfeeoonsandMvcosesrCOll 

o  Virus  Diseases  10021  + 
o  Parasitic  Diseases  rC031  + 

o  Neoplasms  rC041  + 
o  Musculoskeletal  Diseases  FCOSI  + 

e  Diaestiye  System  Diseases  10061 

®  StomatoMiaOiic  Diseases  fCQTl  * 

®  Respiratory  Tract  Diseases  rCOSI  + 

®  Otorhlnolaryneoloelc  Diseases  10091  ♦ 

o  Neryous  System  Diseases  rClQl  + 

o  Eve  Diseases  rcill  * 

o  Male  Urogenital  Diseases  rC121  + 

o  Female  Urogenital  Diseases  and  Pregnancy  Complications  10131 

®  Cardloyasciilar  Diseases  fC14l 

®  Hemic  and  Lymphafic  Diseases  rci51  + 

®  Congenital.  Hereditary  and  Neonatal  Diseases  and  Abnormalities  rcigl 
®  Skin  and  ConnecnveTtssue  Diseases  rClTI  * 

®  Nutritional  and  Metabolic  Diseases  [C181  * 

®  Endocrine  System  Diseases  fC191  -i- 

®  IniTnime  System  Diseases  rC201  + 

®  Disorders  of  Enyironmental  Origin  rC2n  -i- 

o  Animal  Diseases  10221  + 

®  Pathological  CoDdldoas.  Signs  and  Symptoms  10231 

®  Occupational  Diseases  10241  -*• 
o  Chemically-Induced  Disorders  10251 

o  Wounds  and  Injuries  10261  + 

4.  *  Chemicals  and  Drugs  P] 

5.  +  Analytical,  Diagnostic  and  Therapeutic  Techniques  and  Equipment  [E] 

6.  ♦  Psychiatry  and  Psychology  [F] 

7.  +  Phenomena  and  Processes  [G] 

8.  ♦  Disciplines  and  Occupations  [H] 

9.  ♦  Anthropology,  Education.  Sociology  and  Social  Phenomena  [I] 

10.  ♦  Technology,  Indusny,  Agriculture  [J] 

11.  ♦  Humanlrles  [K] 

12.  ♦  Information  Science  [L] 

13.  ♦  Named  Grotqts  Dul] 

14.  ♦  Health  Care  [N] 

15.  ♦  Publication  Characteristics  [V] 

16.  *  Geographicals  [Z] 


Fig.  2:  MeSH  hierarchy  with  the  disease  branch  expanded 


2.  MeSH  vocabulary:  we  exclude  all  words  in  a  document  that  are  not  present 
in  the  MeSH  hierarchy; 

3.  MeSH-CD:  we  exclude  all  words  in  a  document  that  are  not  present  in  the 
branch  (C)  -  Disease  or  (D)  -  Chemicals  and  Drugs  of  the  MeSH  hierarchy. 

We  use  all  3  indexes  for  runs  1,  2,  3  and  4,  and  only  MeSH-CD  for  run  5,  as 
describe  in  Table  1.  For  all  runs,  a  script  to  lowercase  and  remove  punctuation 
is  also  used. 


System  Indexing  Variants  Query  Variants 


Model  Search  Engine  All  Words  MeSH  voc.  MeSH-CD  Whole  Desc.  Metamap  Filter 


Runl  Language  Model 

Indri 

V 

V 

V 

■/ 

V 

Run2  Vector  Space  Model 

Lucene 

V 

V 

V 

■/ 

V 

Runs  BM25 

Xapian 

V 

V 

V 

V 

V 

Run4 

Combo 

v 

V 

V 

V 

V 

Run5 

Word2Vec 

V 

■/ 

Table  1:  Summary  description  of  all  5  runs 


2.2  Selecting  Terms  in  the  Topics 

We  employ  NLM’s  Metamap  (version  2013)  with  default  processing  options  [1] 
to  annotate  all  the  topics.  Metamap  maps  the  topics  to  UMLS  concepts  and 
semantic  types.  There  are  a  total  of  133  semantic  types,  but  some  of  them 
(e.g..  Clinical  Drug  or  Disease  or  Syndrome)  are  more  important  than  others 


in  our  experiments^.  For  example,  the  last  sentence  in  the  description  part  of 
Figure  1  is:  “A  62-year-old  man  sees  a  neurologist  for  progressive  memory  loss 
and  jerking  movements  of  the  lower  extremities”  from  which  Metamap  identifies 
concepts  such  as: 

—  Concept:  /year  (per  year)  -  Semantic  Type:  Temporal  Concept 

—  Concept:  Old  -  Semantic  type:  Temporal  Concept 

—  Concept:  MAN  (Male  gender)  -  Semantic  type:  Finding 

—  Concept:  sees  (Vision)  -  Semantic  type:  Organism  Function 

—  Concept:  (Lower  -  spatial  qualifier)-  Semantic  type:  Spatial  Concept 

In  an  automatic  manner,  we  kept  only  the  concepts  in  which  the  seman¬ 
tic  types  are  related  to  symptoms,  diseases  or  treatments  (based  on  [6]):  man, 
memory  loss,  jerking  movements,  and  lower  extremities. 

For  each  topic,  we  can: 

1.  use  the  description  of  the  topic  as  the  query; 

2.  use  only  the  keywords  related  to  symptom,  diseases  or  treatments,  provided 
by  Metamap  semantic  types. 

For  runs  1,  2,  3  and  4  we  generated  runs  both  possibilities.  For  Run5,  we 
only  generated  runs  using  only  the  second  possibility. 

2.3  Runl 

Runl  was  based  on  Indri^.  Indri  is  a  search  engine  from  the  Lemur  project, 
mainly  based  on  Language  Modeling  as  retrieval  model.  We  used  only  the  #com- 
bine  operator  in  our  experiments.  Six  runs  were  generated:  three  different  index¬ 
ing  strategies  combined  with  two  different  ways  to  formulate  the  queries.  The 
runs  were  combined  simply  adding  the  scores  for  each  document. 

2.4  Run2 

Lucene"^  is  a  text  search  engine  written  in  Java  and  supported  by  the  Apache 
Foundation.  The  default  search  model  of  Lucene  is  the  Vector  Space  Model 
(VSM),  and  it  was  used  with  the  default  parameters.  As  for  Runl,  six  runs  were 
generated  and  combined  summing  the  scores  of  each  individual  document. 

2.5  Runs 

Xapian^  is  also  an  open  source  search  engine.  It  is  written  in  C-|— I-  and  has 
BM25  weighting  scheme  as  its  default.  The  scores  of  the  six  run  created  were 
also  summed  for  each  document. 

^  A  complete  list  of  every  semantic  type  can  be  found  here:  http://metainap.nlm. 

nih . gov/SemanticTypesAndGroups . shtml 
^  http : //www. lemurproject . org/ indri/ 
http :  / /lucene  .  apache  .  org/ 

®  http://xapian.org/ 


2.6  Run4 


Our  Run4  is  the  combination  of  all  the  previous  runs.  However,  instead  of  using 
the  raw  scores  provided  by  the  systems,  we  used  the  position  a  document  had 
in  each  run  as  its  score  (1 /position). 

2.7  Runs 

Word2Vec®  provides  vector  representation  of  words  by  using  deep  learning.  We 
had  to  compared  each  word  in  the  query  with  each  word  in  the  documents,  in  a 
quadratic  procedure.  Therefore,  we  used  only  the  MeSH-CD  indexing  strategy 
and  the  Metamap  strategy  for  building  the  queries. 

3  Results 

We  detail  the  results  for  all  30  topics  in  Figure  3.  There  were  some  very  difficult 
topics,  such  as  3,  9,  23  and  25,  in  which  more  than  50%  of  all  participant  systems 
could  not  find  one  single  relevant  document  in  the  top  10.  For  other  topics,  such 
topic  4  and  27,  the  results  were  in  general  high.  On  average,  our  systems,  in 
special  the  ones  using  Xapian  and  Lucene  as  base,  were  as  good  as  the  median 
system  for  both  P@10  and  inferred  NDCG. 

In  general,  Runl  was  our  worst  run,  performing  much  worse  than  the  others. 
Run5  also  did  not  perform  so  well,  but  it  can  be  explained  by  the  fact  that  only 
the  smaller  indexing  strategy  (MeSH-CD)  and  Metamap  queries  were  used  for 
this  system.  In  any  case,  a  detailed  investigation  of  the  performance  of  these 
two  runs  need  to  be  carried  in  the  future.  Run2  and  Run3  were  our  best  runs, 
Run3  had  slight  better  performance  for  P@10,  but  Run2  was  better  for  inferred 
NDCG.  Run4  was  stable  enough  to  perform  relatively  well  even  after  the  terrible 
performance  of  Runl.  Table  2  compares  the  averaged  results  for  all  5  runs,  the 
median  and  the  best  system  for  each  topic. 


Runs 

PIO 

InfNDCG 

infAP 

RPrec 

Best 

0.71 

0.520 

0.180 

0.350 

Median 

0.23 

0.151 

0.032 

0.126 

Runl 

0.02 

0.017 

0.001 

0.007 

Run2 

0.28 

0.193 

0.057 

0.174 

Run3 

0.29 

0.171 

0.042 

0.152 

Run4 

0.23 

0.152 

0.033 

0.141 

Runs 

0.14 

0.059 

0.009 

0.040 

Table  2:  Results  averaged  over  the  30  topics  for  each  of  our  5  runs,  the  Best  and 
Median  system. 


https : / /code . google . com/p/word2vec/ 


Inferred  NDCG  Precision  @10 
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Fig.  3:  Precision  at  10  and  Inferred  NDCG  for  all  30  topics. 


4  Conclusion  and  Future  Work 

Improving  search  systems  for  health  related  documents  is  an  important  challenge 
for  information  retrieval  researchers.  In  this  work,  we  focused  on  creating  a 
robust  baseline  system,  testing  different  search  models  and  indexing  alternatives 
and  possible  ensembles. 

Our  experiments  have  shown  that  Lucene,  using  Vector  Space  Model,  and 
Xapian,  using  BM25,  had  very  similar  performances.  An  ensemble  of  these  two 
can  lead  for  better  results,  and  it  is  one  of  our  future  work.  Also,  investigat¬ 
ing  what  caused  the  Language  Model  of  Indri  to  perform  so  bad  is  left  as  an 
important  future  work. 
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